Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.
Changes:
- Introduces a new
Endpoint::RESPONSESand routes/v3/responses+/v3/v1/responsesto it. - Adds Responses-specific request parsing (
input,max_output_tokens, conflict check withmax_completion_tokens) and unary response JSON serialization (object: "response",output: [...]). - Extends unit tests to cover Responses parsing and unary response structure.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/test/http_openai_handler_test.cpp |
Adds unit tests validating Responses parsing and unary response output structure. |
src/llm/servable.cpp |
Routes /v3/responses to the new endpoint and prepares inputs for it. |
src/llm/apis/openai_completions.hpp |
Adds Endpoint::RESPONSES and declares parseResponsesPart(). |
src/llm/apis/openai_completions.cpp |
Implements Responses parsing and a new unary serialization path for Responses output format. |
Comments suppressed due to low confidence (2)
src/llm/apis/openai_completions.cpp:826
- For RESPONSES requests, the validation error message says
"max_tokens value should be greater than 0", but the endpoint-specific fields aremax_output_tokens/max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.
// specific part of max_tokens validation
if (request.maxTokens == 0) {
return absl::InvalidArgumentError("max_tokens value should be greater than 0");
}
src/llm/servable.cpp:76
- The invalid-endpoint error message lists
/v3/tokenize, butTokenizeParser::isTokenizeEndpoint()accepts any URI that ends withtokenize(including/v3/v1/tokenize). Consider listing both/v3/tokenizeand/v3/v1/tokenize(consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.
return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");
src/llm/apis/openai_completions.cpp
Outdated
| std::optional<uint32_t> maxCompletionTokens; | ||
| std::optional<uint32_t> maxOutputTokens; | ||
|
|
||
| // max_completion_tokens: uint; optional | ||
| it = doc.FindMember("max_completion_tokens"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) { | ||
| if (it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295"); | ||
| return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer"); | ||
| } | ||
| if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value()) | ||
| return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value())); | ||
| maxCompletionTokens = it->value.GetUint(); | ||
| } | ||
|
|
||
| // max_output_tokens: uint; optional | ||
| // OpenAI Responses API uses this field for output token limit. | ||
| it = doc.FindMember("max_output_tokens"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) { | ||
| if (it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295"); | ||
| return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer"); | ||
| } | ||
| if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value()) | ||
| return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value())); | ||
| maxOutputTokens = it->value.GetUint(); | ||
| } | ||
|
|
||
| if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) { | ||
| return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided"); | ||
| } | ||
| if (maxOutputTokens.has_value()) { | ||
| request.maxTokens = maxOutputTokens.value(); | ||
| } else if (maxCompletionTokens.has_value()) { | ||
| request.maxTokens = maxCompletionTokens.value(); | ||
| } |
There was a problem hiding this comment.
OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.
| if (request.maxTokens.has_value()) { | ||
| writer.String("max_output_tokens"); | ||
| writer.Int(request.maxTokens.value()); | ||
| } | ||
|
|
||
| writer.String("output"); | ||
| writer.StartArray(); | ||
| int outputIndex = 0; | ||
| for (const auto& parsedOutput : parsedOutputs) { | ||
| const std::string outputId = "msg-" + std::to_string(outputIndex++); | ||
|
|
||
| writer.StartObject(); | ||
| writer.String("id"); | ||
| writer.String(outputId.c_str()); | ||
| writer.String("type"); | ||
| writer.String("message"); | ||
| writer.String("role"); | ||
| writer.String("assistant"); | ||
| writer.String("status"); | ||
| writer.String("completed"); | ||
| writer.String("content"); | ||
| writer.StartArray(); | ||
| writer.StartObject(); | ||
| writer.String("type"); | ||
| writer.String("output_text"); | ||
| writer.String("text"); | ||
| writer.String(parsedOutput.content.c_str()); | ||
| writer.String("annotations"); | ||
| writer.StartArray(); | ||
| writer.EndArray(); | ||
| writer.EndObject(); | ||
| writer.EndArray(); | ||
| writer.EndObject(); | ||
| } | ||
| writer.EndArray(); | ||
|
|
||
| writer.String("usage"); | ||
| writer.StartObject(); | ||
| writer.String("input_tokens"); | ||
| writer.Int(usage.promptTokens); | ||
| writer.String("input_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("cached_tokens"); | ||
| writer.Int(0); | ||
| writer.EndObject(); | ||
| writer.String("output_tokens"); | ||
| writer.Int(usage.completionTokens); | ||
| writer.String("output_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("reasoning_tokens"); | ||
| writer.Int(0); | ||
| writer.EndObject(); | ||
| writer.String("total_tokens"); | ||
| writer.Int(usage.calculateTotalTokens()); | ||
| writer.EndObject(); |
There was a problem hiding this comment.
serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.
There was a problem hiding this comment.
@michalkulakowski can you protect against such corner cases?
src/llm/apis/openai_completions.cpp
Outdated
| const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count(); | ||
| const std::string responseId = "resp-" + std::to_string(createdAt); | ||
|
|
There was a problem hiding this comment.
responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.
ea72634 to
299e2be
Compare
e598abe to
852da04
Compare
| :::{dropdown} **Unary call with curl using responses endpoint** | ||
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) |
There was a problem hiding this comment.
| :::{dropdown} **Unary call with curl using responses endpoint** | |
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) | |
| :::{dropdown} **Unary call with cURL using Responses API** | |
| **Note**: Using cURL in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md). |
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) | ||
|
|
||
| ```bash | ||
| curl http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}" |
There was a problem hiding this comment.
Can you break down the command into multiple lines to showcase what is actually being sent?
demos/continuous_batching/README.md
Outdated
| ::: | ||
|
|
||
|
|
||
| ### Unary calls to responses endpoint using cURL |
There was a problem hiding this comment.
| ### Unary calls to responses endpoint using cURL | |
| ### Unary calls via Responses API using cURL |
|
|
||
| ::: | ||
|
|
||
| :::{dropdown} **Streaming request with OpenAI client using responses endpoint** |
There was a problem hiding this comment.
| :::{dropdown} **Streaming request with OpenAI client using responses endpoint** | |
| :::{dropdown} **Streaming request with OpenAI client via Responses API** |
|
|
||
| ## Benchmarking text generation with high concurrency | ||
|
|
||
| OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. |
There was a problem hiding this comment.
Why is it being added in responses api PR? Is it related?
src/llm/apis/openai_completions.cpp
Outdated
| writer.String("input_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("cached_tokens"); | ||
| writer.Uint64(0); |
There was a problem hiding this comment.
please add note here that it is not supported
src/llm/apis/openai_completions.cpp
Outdated
| writer.String("output_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("reasoning_tokens"); | ||
| writer.Uint64(0); |
There was a problem hiding this comment.
can we support it in this release?
src/llm/apis/openai_completions.cpp
Outdated
|
|
||
| if (inputIt->value.IsString()) { | ||
| request.prompt = inputIt->value.GetString(); | ||
| if (!request.prompt.has_value() || request.prompt.value().empty()) { |
There was a problem hiding this comment.
request.prompt must have value if you set it one line earlier
src/llm/apis/openai_completions.cpp
Outdated
| } | ||
| request.imageHistory.push_back({i, tensor}); | ||
| } else { | ||
| return absl::InvalidArgumentError("Unsupported content type"); |
There was a problem hiding this comment.
can you include whats the content type? this way we can easily see in logs what was expected supported content type by user
src/llm/apis/openai_completions.cpp
Outdated
| if (existingMessages != doc.MemberEnd()) { | ||
| existingMessages->value = messages; | ||
| } else { | ||
| doc.AddMember("messages", messages, allocator); |
There was a problem hiding this comment.
Why is the document being edited? If I understand correctly, we pare doc to our request format. Why do we need to edit the doc? Shouldnt it be left untouched and just work on request in next steps? Why do we parse if we later not use it?
src/llm/apis/openai_completions.cpp
Outdated
| auto& obj = it->value.GetArray()[i]; | ||
| if (!obj.IsObject()) | ||
| return absl::InvalidArgumentError("Tool is not a JSON object"); | ||
| rapidjson::Value* functionObj = nullptr; |
There was a problem hiding this comment.
functionObj is never used outside of scope, you dont need to work on pointers, contrary to parametersValue where it is required
src/llm/apis/openai_completions.cpp
Outdated
| parametersValue->Accept(writer); | ||
| std::string parametersStr = buffer.GetString(); | ||
| ToolSchemaWrapper schemaReprs{parametersValue, std::move(parametersStr)}; | ||
| request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs); |
There was a problem hiding this comment.
| request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs); | |
| request.toolNameSchemaMap[functionName] = std::move(schemaReprs); |
src/llm/apis/openai_completions.cpp
Outdated
| if (it->value.GetUint() == 0) | ||
| return absl::InvalidArgumentError("n value should be greater than 0"); | ||
| if (endpoint == Endpoint::RESPONSES && request.stream && it->value.GetUint() > 1) | ||
| return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming"); |
There was a problem hiding this comment.
| return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming"); | |
| return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming"); |
src/llm/apis/openai_completions.cpp
Outdated
| if (endpoint == Endpoint::RESPONSES) { | ||
| std::vector<ParsedOutput> parsedOutputs; | ||
| for (const auto& tokens : results.tokens) { | ||
| updateUsage(usage, tokens, request.echo); |
There was a problem hiding this comment.
We just recently switched to using PerfMetrics @mzegla
Why cant we use it here?
src/llm/apis/openai_completions.cpp
Outdated
| events.emplace_back(serializeResponsesEvent([this, &responseId, createdAt](Writer<StringBuffer>& writer) { | ||
| writer.StartObject(); | ||
| writer.String("type"); | ||
| writer.String("response.created"); |
There was a problem hiding this comment.
it might not be good place for this event to be streamed, please verify vs other servings maybe if OpenAI is not specific enough
mzegla
left a comment
There was a problem hiding this comment.
It's a big change, let's review it iteratively. Pushing comments only to the logic - did not review tests just yet.
src/llm/apis/openai_completions.cpp
Outdated
| throw std::invalid_argument("Unsupported JSON value type"); | ||
| } | ||
|
|
||
| std::string serializeResponsesEvent(const std::function<void(Writer<StringBuffer>&)>& eventSerializer) { |
There was a problem hiding this comment.
how does it work? evenSerializer carries the data itself?
it's not clear to me what the output string would be
src/llm/apis/openai_completions.cpp
Outdated
| void serializeNotSupportedNullField(Writer<StringBuffer>& writer, const char* fieldName) { | ||
| writer.String(fieldName); | ||
| writer.Null(); | ||
| } | ||
|
|
||
| void serializeNotSupportedZeroField(Writer<StringBuffer>& writer, const char* fieldName) { | ||
| writer.String(fieldName); | ||
| writer.Uint64(0); | ||
| } | ||
|
|
||
| void serializeNotSupportedEmptyArrayField(Writer<StringBuffer>& writer, const char* fieldName) { | ||
| writer.String(fieldName); | ||
| writer.StartArray(); | ||
| writer.EndArray(); | ||
| } |
src/llm/apis/openai_completions.cpp
Outdated
| const char* status, const std::string& fullOutputText, bool includeUsage, | ||
| const char* incompleteReason, const char* errorMessage, const char* errorCode) const { |
There was a problem hiding this comment.
Lots of parameters. Can we use std::strings and get const char* internally, so it's more straightforward to use?
src/llm/apis/openai_completions.hpp
Outdated
| std::string serializeStreamingUsageChunk(); | ||
| std::string serializeStreamingHandshakeChunk(); | ||
| std::string serializeResponsesStreamingInitEvents(); | ||
| std::string serializeResponsesFailedEvent(const std::string& errorMessage, const char* errorCode = "server_error"); |
There was a problem hiding this comment.
error code as string? maybe we could have some enum here
| if (executionContext->apiHandler && executionContext->apiHandler->isStream() && executionContext->endpoint == Endpoint::RESPONSES) { | ||
| std::string failedEvent = executionContext->apiHandler->serializeResponsesFailedEvent(e.what()); | ||
| executionContext->response = wrapTextInServerSideEventMessage(failedEvent); | ||
| executionContext->response += wrapTextInServerSideEventMessage("[DONE]"); | ||
| cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp); | ||
| return absl::OkStatus(); | ||
| } | ||
| return absl::InvalidArgumentError(e.what()); | ||
| } catch (...) { | ||
| if (executionContext->apiHandler && executionContext->apiHandler->isStream() && executionContext->endpoint == Endpoint::RESPONSES) { | ||
| std::string failedEvent = executionContext->apiHandler->serializeResponsesFailedEvent("Response generation failed"); | ||
| executionContext->response = wrapTextInServerSideEventMessage(failedEvent); | ||
| executionContext->response += wrapTextInServerSideEventMessage("[DONE]"); | ||
| cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp); | ||
| return absl::OkStatus(); | ||
| } |
There was a problem hiding this comment.
Are you sure that it will be handled correctly with OK status? I mean for example, will it exit properly?
There was a problem hiding this comment.
Yes, in responses api streaming, opposites to chat/completions, error handling is realized through special event rather than status code.
| std::string initEvents = executionContext->apiHandler->serializeResponsesStreamingInitEvents(); | ||
| if (!initEvents.empty()) { | ||
| executionContext->response = wrapTextInServerSideEventMessage(initEvents); | ||
| cc->Outputs().Tag(OUTPUT_TAG_NAME).Add(new std::string{std::move(executionContext->response)}, iterationBeginTimestamp); | ||
| executionContext->response = ""; | ||
| } | ||
| cc->Outputs().Tag(LOOPBACK_TAG_NAME).Add(new bool{true}, iterationBeginTimestamp); |
There was a problem hiding this comment.
Is that whole thing okay? You schedule and don't read/check output before return.
Aren't we at risk of skipping chunk? Especially if it's generated fast?
There was a problem hiding this comment.
Yes, because here we emits responses api events that don't contain data generate by pipeline
| return false; | ||
| } | ||
|
|
||
| static py::object rapidJsonValueToPyObject(const rapidjson::Value& value) { |
There was a problem hiding this comment.
Is that whole thing necessary? Why don't we serialize to string -> pass to python -> read to json?
| std::string OpenAIChatCompletionsHandler::serializeResponseCreatedEvent(const std::string& responseId, int64_t createdAt) { | ||
| StringBuffer buffer; | ||
| Writer<StringBuffer> writer(buffer); | ||
| writeEventHeader(writer, "response.created"); | ||
| writer.String("response"); | ||
| serializeResponsesResponseObject(writer, responseId, createdAt, "in_progress", "", false); | ||
| writer.EndObject(); | ||
| return buffer.GetString(); | ||
| } | ||
|
|
||
| std::string OpenAIChatCompletionsHandler::serializeResponseInProgressEvent(const std::string& responseId, int64_t createdAt) { | ||
| StringBuffer buffer; | ||
| Writer<StringBuffer> writer(buffer); | ||
| writeEventHeader(writer, "response.in_progress"); | ||
| writer.String("response"); | ||
| serializeResponsesResponseObject(writer, responseId, createdAt, "in_progress", "", false); | ||
| writer.EndObject(); |
There was a problem hiding this comment.
response.created / response.in_progress events serialize a full response object via serializeResponsesResponseObject(), which currently always emits an output array (and may include a placeholder message item). This conflicts with the subsequent emission of response.output_item.added / response.content_part.added events and can lead to duplicated/contradictory stream semantics. Consider making created/in_progress events omit output entirely (or emit an empty output array) and rely on the dedicated output_item/content_part events for initialization.
| // TODO: error not supported in unary response | ||
| writer.String("model"); | ||
| writer.String(request.model.c_str()); | ||
| writer.String("status"); | ||
| writer.String(responseStatus.c_str()); |
There was a problem hiding this comment.
serializeResponsesUnaryResponse() does not emit several spec-aligned fields that the new unit tests expect (e.g. "error":null, "previous_response_id":null, "reasoning":null, "user":null). This will cause the added tests (e.g. serializeUnaryResponseForResponses*) to fail and makes unary and streaming response objects structurally inconsistent. Reuse serializeResponsesResponseObject() for unary responses or explicitly add the missing null fields so the unary payload matches the Responses API schema.
| } | ||
|
|
||
| std::string OpenAIChatCompletionsHandler::serializeResponsesStreamingInitEvents() { | ||
| const auto createdAt = std::chrono::duration_cast<std::chrono::microseconds>(created.time_since_epoch()).count(); |
There was a problem hiding this comment.
Responses streaming uses createdAt in microseconds (duration_cast<microseconds>) and passes it to created_at in the response object. This makes created_at inconsistent with unary responses (seconds) and can even be larger than completed_at (which is seconds), breaking ordering and likely violating the OpenAI Responses API expectations. Use a consistent unit (typically epoch seconds) for created_at/IDs across unary and streaming, and ensure completed_at is in the same unit family.
| const auto createdAt = std::chrono::duration_cast<std::chrono::microseconds>(created.time_since_epoch()).count(); | |
| const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count(); |
| + is_fc_model=True, | ||
| + underscore_to_dot=True, | ||
| + ), | ||
| + "ovms-model-stream-responses": ModelConfig( |
There was a problem hiding this comment.
maybe ovms-model-responses-stream?
more straightforward switching
| } | ||
| #else | ||
| ov::genai::ChatHistory& chatHistory = executionContext->apiHandler->getChatHistory(); | ||
| constexpr bool add_generation_prompt = true; |
There was a problem hiding this comment.
| constexpr bool add_generation_prompt = true; | |
| constexpr bool addGenerationPrompt = true; |
?
src/llm/apis/openai_completions.cpp
Outdated
| writer.String(toolCall.arguments.c_str()); | ||
| writer.EndObject(); | ||
| } | ||
| if (!fullOutputText.empty() || responsesState.toolCalls.empty()) { |
There was a problem hiding this comment.
If we have assumption that if there were tool calls then there is no other content we should document it - for example as a comment here to be explicit.
| } | ||
| } | ||
|
|
||
| request.chatHistory.last()["content"] = contentText; |
| if (input_ids.get_shape().size() != 2) | ||
| throw std::runtime_error("input_ids should have 2 dimensions"); | ||
| if (input_ids.get_shape()[0] != 1) | ||
| throw std::runtime_error("input_ids should have 1 batch size"); | ||
| if (input_ids.get_element_type() != ov::element::i64) | ||
| throw std::runtime_error("input_ids should have i64 element type"); |
There was a problem hiding this comment.
Does user see those errors or are they on our DEBUG logging?
User does not see input_ids, so it's strictly debugging information
There was a problem hiding this comment.
I assume these exception should never happen, but we have same check in chat completions part few lines below.
|
|
||
| int64_t* inputIdsData = reinterpret_cast<int64_t*>(input_ids.data()); | ||
| std::vector<int64_t> generatedTokens(inputIdsData, inputIdsData + input_ids.get_shape()[1]); | ||
| parsedOutputs.push_back(parseOutputIfNeeded(generatedTokens)); |
There was a problem hiding this comment.
parseOutputIfNeeded strictly requires tokens? that's why we need to do this entire operation here?
| const auto& toolCallsArr = deltaObj["tool_calls"]; | ||
| for (rapidjson::SizeType i = 0; i < toolCallsArr.Size(); ++i) { | ||
| const auto& tc = toolCallsArr[i]; | ||
| int tcIndex = tc.HasMember("index") ? tc["index"].GetInt() : 0; |
There was a problem hiding this comment.
Is it possible that for two iterations in the loop we get to have tcIndex = 0 here and overwrite?
There was a problem hiding this comment.
I assume that its not possible, since BaseOutputParser always include id in wrapDelta methods
src/llm/apis/openai_completions.cpp
Outdated
| writer.String("parallel_tool_calls"); | ||
| writer.Bool(false); |
There was a problem hiding this comment.
We have sup port for paralellel tool calls, why is it hardcoded to false here?
| if (!responsesState.reasoningText.empty()) { | ||
| writer.StartObject(); | ||
| writer.String("id"); | ||
| writer.String("rs-0"); |
There was a problem hiding this comment.
what is rs-0? is it duplicated in multi-turn conversation? or is it for cases where model mixes reasoning and content multiple times within 1 unary response?
| writer.StartObject(); | ||
| writer.String("input_tokens"); | ||
| writer.Uint64(static_cast<uint64_t>(usage.promptTokens)); | ||
| // TODO: input_tokens_details.cached_tokens not supported |
There was a problem hiding this comment.
There is table for completions and chat completions API what we support and what we dont. Is there a chance that we could have the same for Responses API? This way, in future we will be able to talk about it clearny and see what is missing and estimate next tasks correctly. I'm just afraid people will assume something is done therefore where in fact its not
| } else { | ||
| return absl::InvalidArgumentError("input is not a string or array"); | ||
| } | ||
|
|
There was a problem hiding this comment.
It's missing my bugfix: https://github.com/openvinotoolkit/model_server/pull/4079/changes
| auto kwargsIt = doc.FindMember("chat_template_kwargs"); | ||
| if (kwargsIt == doc.MemberEnd()) { | ||
| rapidjson::Value kwargs(rapidjson::kObjectType); | ||
| kwargs.AddMember("enable_thinking", true, doc.GetAllocator()); |
There was a problem hiding this comment.
Barely any model really uses this to enable thinking. Qwen3-VL for example doesnt. Is it an issue? @michalkulakowski @dtrawins
| std::stringstream ss; | ||
| ss << events.front(); | ||
| for (size_t i = 1; i < events.size(); ++i) { | ||
| ss << "\n\ndata: " << events[i]; |
There was a problem hiding this comment.
There's already util for that, why cant we reuse it here? wrapTextInServerSideEventMessage
5bea7d1 to
abcebb9
Compare
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``